AITopics | hypergeometric distribution

Collaborating Authors

hypergeometric distribution

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Differentiable Random Partition Models

Neural Information Processing SystemsNov-19-2025, 06:45:10 GMT

Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems.

artificial intelligence, experiment, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
North America > Canada > British Columbia > Vancouver (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Differentiable Random Partition Models

Neural Information Processing SystemsOct-9-2025, 01:39:38 GMT

Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems.

artificial intelligence, experiment, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
North America > Canada > British Columbia > Vancouver (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Confidence sequences for sampling without replacement Ian Waudby-Smith

Neural Information Processing SystemsAug-17-2025, 02:25:27 GMT

We present a generic approach to constructing a frequentist CS using Bayesian tools, based on the fact that the ratio of a prior to the posterior at the ground truth is a martingale. We then present Hoeffding-and empirical-Bernstein-type time-uniform CSs and fixed-time confidence intervals for sampling WoR, which improve on previous bounds in the literature and explicitly quantify the benefit of WoR sampling.

artificial intelligence, machine learning, sequence, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada (0.04)

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)

Add feedback

On Design Choices in Similarity-Preserving Sparse Randomized Embeddings

Kleyko, Denis, Rachkovskij, Dmitri A.

arXiv.org Artificial IntelligenceDec-12-2024

Expand & Sparsify is a principle that is observed in anatomically similar neural circuits found in the mushroom body (insects) and the cerebellum (mammals). Sensory data are projected randomly to much higher-dimensionality (expand part) where only few the most strongly excited neurons are activated (sparsify part). This principle has been leveraged to design a FlyHash algorithm that forms similarity-preserving sparse embeddings, which have been found useful for such tasks as novelty detection, pattern recognition, and similarity search. Despite its simplicity, FlyHash has a number of design choices to be set such as preprocessing of the input data, choice of sparsifying activation function, and formation of the random projection matrix. In this paper, we explore the effect of these choices on the performance of similarity search with FlyHash embeddings. We find that the right combination of design choices can lead to drastic difference in the search performance.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/IJCNN60899.2024.10651277

2501.14741

Country:

Europe > Ukraine > Kyiv Oblast > Kyiv (0.14)
Europe > Sweden > Örebro County > Örebro (0.04)
Europe > Sweden > Norrbotten County > Luleå (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report (0.83)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

Theoretical Proportion Label Perturbation for Learning from Label Proportions in Large Bags

Kubo, Shunsuke, Matsuo, Shinnosuke, Suehiro, Daiki, Terada, Kazuhiro, Ito, Hiroaki, Yoshizawa, Akihiko, Bise, Ryoma

arXiv.org Artificial IntelligenceAug-26-2024

Learning from label proportions (LLP) is a kind of weakly supervised learning that trains an instance-level classifier from label proportions of bags, which consist of sets of instances without using instance labels. A challenge in LLP arises when the number of instances in a bag (bag size) is numerous, making the traditional LLP methods difficult due to GPU memory limitations. This study aims to develop an LLP method capable of learning from bags with large sizes. In our method, smaller bags (mini-bags) are generated by sampling instances from large-sized bags (original bags), and these mini-bags are used in place of the original bags. However, the proportion of a mini-bag is unknown and differs from that of the original bag, leading to overfitting. To address this issue, we propose a perturbation method for the proportion labels of sampled mini-bags to mitigate overfitting to noisy label proportions. This perturbation is added based on the multivariate hypergeometric distribution, which is statistically modeled. Additionally, loss weighting is implemented to reduce the negative impact of proportions sampled from the tail of the distribution. Experimental results demonstrate that the proportion label perturbation and loss weighting achieve classification accuracy comparable to that obtained without sampling. Our codes are available at https://github.com/stainlessnight/LLP-LargeBags.

label proportion, original bag, proportion, (15 more...)

arXiv.org Artificial Intelligence

2408.1413

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States (0.14)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Asia > Japan > Kyūshū & Okinawa > Kyūshū (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

On Efficient and Statistical Quality Estimation for Data Annotation

Klie, Jan-Christoph, Haladjian, Juan, Kirchner, Marc, Nair, Rahul

arXiv.org Artificial IntelligenceMay-29-2024

Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often than not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes costs money that could be better spent, for instance on more annotations. Therefore, we first describe in detail how to use confidence intervals for finding the minimal sample size needed to estimate the annotation error rate. Then, we propose applying acceptance sampling as an alternative to error rate estimation We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.

confidence interval, error rate, sample size, (15 more...)

arXiv.org Artificial Intelligence

2405.11919

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > New York (0.04)
North America > United States > Oregon (0.04)
(12 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.81)

Add feedback

Estimating Unknown Population Sizes Using the Hypergeometric Distribution

Hodgson, Liam, Bzdok, Danilo

arXiv.org Machine LearningFeb-21-2024

The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data.

category, likelihood, unknown population size, (11 more...)

arXiv.org Machine Learning

2402.1422

Country:

Asia > Middle East > Jordan (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.31)

Add feedback

Overdrawing Urns using Categories of Signed Probabilities

Jacobs, Bart, Stein, Dario

arXiv.org Artificial IntelligenceDec-14-2023

For drawing (multiple) coloured balls from a statistical urn, we distinguish three well-known modes: 1. hypergeometric or draw-and-delete, which is drawing a ball from the urn without replacement, so that the urn shrinks; 2. multinomial or draw-and-replace: drawing with replacement, so that the urn remains the same; 3. Pólya or draw-and-duplicate, which is drawing a ball from the urn and replacing it together with an additional ball of the same colour, so that the urn grows. Multinomial and Pólya draws may be of arbitrary size, but hypergeometric draws are limited in size by the number of balls in the urn. In this paper we lift this limitation and allow hypergeometric draws of arbitrary size, including'overdraws', containing more balls than in the urn. Physically this is strange, but, as will show, mathematically it makes sense once we allow negative probabilities. Negative probabilities have emerged in quantum physics (e.g. in double slit experiments) and have been discussed in the work of famous physicists like Wigner, Dirac, and Feynman (see e.g.

hypergeometric distribution, multiset, probability, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.4204/EPTCS.397.11

2312.12453

Country: Europe > Netherlands > Gelderland > Nijmegen (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.46)

Add feedback

Differentiable Random Partition Models

Sutter, Thomas M., Ryser, Alain, Liebeskind, Joram, Vogt, Julia E.

arXiv.org Artificial IntelligenceNov-8-2023

Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems. However, assigning elements, such as samples in a dataset or neurons in a network layer, to an unknown and discrete number of subsets is inherently non-differentiable, prohibiting end-to-end gradient-based optimization of parameters. We overcome this limitation by proposing a novel two-step method for inferring partitions, which allows its usage in variational inference tasks. This new approach enables reparameterized gradients with respect to the parameters of the new random partition model. Our method works by inferring the number of elements per subset and, second, by filling these subsets in a learned order. We highlight the versatility of our general-purpose approach on three different challenging experiments: variational clustering, inference of shared and independent generative factors under weak supervision, and multitask learning.

equation, experiment, subset, (14 more...)

arXiv.org Artificial Intelligence

2305.16841

Country:

North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Learning Group Importance using the Differentiable Hypergeometric Distribution

Sutter, Thomas M., Manduchi, Laura, Ryser, Alain, Vogt, Julia E.

arXiv.org Artificial IntelligenceMay-8-2023

Partitioning a set of elements into subsets of a priori unknown sizes is essential in many applications. These subset sizes are rarely explicitly learned - be it the cluster sizes in clustering applications or the number of shared versus independent generative latent factors in weakly-supervised learning. Probability distributions over correct combinations of subset sizes are non-differentiable due to hard constraints, which prohibit gradient-based optimization. In this work, we propose the differentiable hypergeometric distribution. The hypergeometric distribution models the probability of different group sizes based on their relative importance. We introduce reparameterizable gradients to learn the importance between groups and highlight the advantage of explicitly learning the size of subsets in two typical applications: weakly-supervised learning and clustering. In both applications, we outperform previous approaches, which rely on suboptimal heuristics to model the unknown size of groups. Many machine learning approaches rely on differentiable sampling procedures, from which the reparameterization trick for Gaussian distributions is best known (Kingma & Welling, 2014; Rezende et al., 2014). The non-differentiable nature of discrete distributions has long hindered their use in machine learning pipelines with end-to-end gradient-based optimization. Only the concrete distribution (Maddison et al., 2017) or Gumbel-Softmax trick (Jang et al., 2016) boosted the use of categorical distributions in stochastic networks. Unlike the high-variance gradients of score-based methods such as REINFORCE (Williams, 1992), these works enable reparameterized and lowvariance gradients with respect to the categorical weights. Despite enormous progress in recent years, the extension to more complex probability distributions is still missing or comes with a trade-off regarding differentiability or computational speed (Huijben et al., 2021). The hypergeometric distribution plays a vital role in various areas of science, such as social and computer science and biology. The range of applications goes from modeling gene mutations and recommender systems to analyzing social networks (Becchetti et al., 2011; Casiraghi et al., 2016; Lodato et al., 2015). The hypergeometric distribution describes sampling without replacement and, therefore, models the number of samples per group given a limited number of total samples. Hence, it is essential wherever the choice of a single group element influences the probability of the remaining elements being drawn. Previous work mainly uses the hypergeometric distribution implicitly to model assumptions or as a tool to prove theorems.

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2203.01629

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.46)

Industry:

Information Technology (0.34)
Health & Medicine > Therapeutic Area (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback